53 research outputs found
Poor starting points in machine learning
Poor (even random) starting points for learning/training/optimization are
common in machine learning. In many settings, the method of Robbins and Monro
(online stochastic gradient descent) is known to be optimal for good starting
points, but may not be optimal for poor starting points -- indeed, for poor
starting points Nesterov acceleration can help during the initial iterations,
even though Nesterov methods not designed for stochastic approximation could
hurt during later iterations. The common practice of training with nontrivial
minibatches enhances the advantage of Nesterov acceleration.Comment: 11 pages, 3 figures, 1 table; this initial version is literally
identical to that circulated among a restricted audience over a month ag
Testing the significance of assuming homogeneity in contingency-tables/cross-tabulations
The model for homogeneity of proportions in a two-way
contingency-table/cross-tabulation is the same as the model of independence,
except that the probabilistic process generating the data is viewed as fixing
the column totals (but not the row totals). When gauging the consistency of
observed data with the assumption of independence, recent work has illustrated
that the Euclidean/Frobenius/Hilbert-Schmidt distance is often far more
statistically powerful than the classical statistics such as chi-square, the
log-likelihood-ratio (G), the Freeman-Tukey/Hellinger distance, and other
members of the Cressie-Read power-divergence family. The present paper
indicates that the Euclidean/Frobenius/Hilbert-Schmidt distance can be more
powerful for gauging the consistency of observed data with the assumption of
homogeneity, too.Comment: 14 pages, 18 table
A fast algorithm for computing minimal-norm solutions to underdetermined systems of linear equations
We introduce a randomized algorithm for computing the minimal-norm solution
to an underdetermined system of linear equations. Given an arbitrary full-rank
m x n matrix A with m<n, any m x 1 vector b, and any positive real number
epsilon less than 1, the procedure computes an n x 1 vector x approximating to
relative precision epsilon or better the n x 1 vector p of minimal Euclidean
norm satisfying Ap=b. The algorithm typically requires O(mn
log(sqrt(n)/epsilon) + m**3) floating-point operations, generally less than the
O(m**2 n) required by the classical schemes based on QR-decompositions or
bidiagonalization. We present several numerical examples illustrating the
performance of the algorithm.Comment: 13 pages, 4 table
Regression-aware decompositions
Linear least-squares regression with a "design" matrix A approximates a given
matrix B via minimization of the spectral- or Frobenius-norm discrepancy
||AX-B|| over every conformingly sized matrix X. Another popular approximation
is low-rank approximation via principal component analysis (PCA) -- which is
essentially singular value decomposition (SVD) -- or interpolative
decomposition (ID). Classically, PCA/SVD and ID operate solely with the matrix
B being approximated, not supervised by any auxiliary matrix A. However, linear
least-squares regression models can inform the ID, yielding regression-aware
ID. As a bonus, this provides an interpretation as regression-aware PCA for a
kind of canonical correlation analysis between A and B. The regression-aware
decompositions effectively enable supervision to inform classical
dimensionality reduction, which classically has been totally unsupervised. The
regression-aware decompositions reveal the structure inherent in B that is
relevant to regression against A.Comment: 19 pages, 9 figures, 2 table
Recurrence relations and fast algorithms
We construct fast algorithms for evaluating transforms associated with
families of functions which satisfy recurrence relations. These include
algorithms both for computing the coefficients in linear combinations of the
functions, given the values of these linear combinations at certain points,
and, vice versa, for evaluating such linear combinations at those points, given
the coefficients in the linear combinations; such procedures are also known as
analysis and synthesis of series of certain special functions. The algorithms
of the present paper are efficient in the sense that their computational costs
are proportional to n (ln n) (ln(1/epsilon))^3, where n is the amount of input
and output data, and epsilon is the precision of computations. Stated somewhat
more precisely, we find a positive real number C such that, for any positive
integer n > 10, the algorithms require at most C n (ln n) (ln(1/epsilon))^3
floating-point operations and words of memory to evaluate at n appropriately
chosen points any linear combination of n special functions, given the
coefficients in the linear combination, where epsilon is the precision of
computations.Comment: 24 page
A fast randomized algorithm for orthogonal projection
We describe an algorithm that, given any full-rank matrix A having fewer rows
than columns, can rapidly compute the orthogonal projection of any vector onto
the null space of A, as well as the orthogonal projection onto the row space of
A, provided that both A and its adjoint can be applied rapidly to arbitrary
vectors. As an intermediate step, the algorithm solves the overdetermined
linear least-squares regression involving the adjoint of A (and so can be used
for this, too). The basis of the algorithm is an obvious but numerically
unstable scheme; suitable use of a preconditioner yields numerical stability.
We generate the preconditioner rapidly via a randomized procedure that succeeds
with extremely high probability. In many circumstances, the method can
accelerate interior-point methods for convex optimization, such as linear
programming (Ming Gu, personal communication).Comment: 13 pages, 6 table
Testing goodness-of-fit for logistic regression
Explicitly accounting for all applicable independent variables, even when the
model being tested does not, is critical in testing goodness-of-fit for
logistic regression. This can increase statistical power by orders of
magnitude.Comment: 13 pages, 4 table
A comparison of the discrete Kolmogorov-Smirnov statistic and the Euclidean distance
Goodness-of-fit tests gauge whether a given set of observations is consistent
(up to expected random fluctuations) with arising as independent and
identically distributed (i.i.d.) draws from a user-specified probability
distribution known as the "model." The standard gauges involve the discrepancy
between the model and the empirical distribution of the observed draws. Some
measures of discrepancy are cumulative; others are not. The most popular
cumulative measure is the Kolmogorov-Smirnov statistic; when all probability
distributions under consideration are discrete, a natural noncumulative measure
is the Euclidean distance between the model and the empirical distributions. In
the present paper, both mathematical analysis and its illustration via various
data sets indicate that the Kolmogorov-Smirnov statistic tends to be more
powerful than the Euclidean distance when there is a natural ordering for the
values that the draws can take -- that is, when the data is ordinal -- whereas
the Euclidean distance is more reliable and more easily understood than the
Kolmogorov-Smirnov statistic when there is no natural ordering (or partial
order) -- that is, when the data is nominal.Comment: 15 pages, 6 figures, 3 table
An implementation of a randomized algorithm for principal component analysis
Recent years have witnessed intense development of randomized methods for
low-rank approximation. These methods target principal component analysis (PCA)
and the calculation of truncated singular value decompositions (SVD). The
present paper presents an essentially black-box, fool-proof implementation for
Mathworks' MATLAB, a popular software platform for numerical computation. As
illustrated via several tests, the randomized algorithms for low-rank
approximation outperform or at least match the classical techniques (such as
Lanczos iterations) in basically all respects: accuracy, computational
efficiency (both speed and memory usage), ease-of-use, parallelizability, and
reliability. However, the classical procedures remain the methods of choice for
estimating spectral norms, and are far superior for calculating the least
singular values and corresponding singular vectors (or singular subspaces).Comment: 13 pages, 4 figure
Cumulative deviation of a subpopulation from the full population
Assessing equity in treatment of a subpopulation often involves assigning
numerical "scores" to all individuals in the full population such that similar
individuals get similar scores; matching via propensity scores or appropriate
covariates is common, for example. Given such scores, individuals with similar
scores may or may not attain similar outcomes independent of the individuals'
memberships in the subpopulation. The traditional graphical methods for
visualizing inequities are known as "reliability diagrams" or "calibrations
plots," which bin the scores into a partition of all possible values, and for
each bin plot both the average outcomes for only individuals in the
subpopulation as well as the average outcomes for all individuals; comparing
the graph for the subpopulation with that for the full population gives some
sense of how the averages for the subpopulation deviate from the averages for
the full population. Unfortunately, real data sets contain only finitely many
observations, limiting the usable resolution of the bins, and so the
conventional methods can obscure important variations due to the binning.
Fortunately, plotting cumulative deviation of the subpopulation from the full
population as proposed in this paper sidesteps the problematic coarse binning.
The cumulative plots encode subpopulation deviation directly as the slopes of
secant lines for the graphs. Slope is easy to perceive even when the constant
offsets of the secant lines are irrelevant. The cumulative approach avoids
binning that smooths over deviations of the subpopulation from the full
population. Such cumulative aggregation furnishes both high-resolution
graphical methods and simple scalar summary statistics (analogous to those of
Kuiper and of Kolmogorov and Smirnov used in statistical significance testing
for comparing probability distributions).Comment: 70 pages, 51 figures, 2 tables; the new versions of the paper merge
in most of arXiv:2006.0250
- …